The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Mon Jun 29 14:04:49 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Mon Jun 29 14:04:49 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.28.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.28.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.28.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.28.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 159 159
Albania 159 159
Algeria 159 159
Andorra 159 159
Angola 159 159
Antigua and Barbuda 159 159
Argentina 159 159
Armenia 159 159
Australia 1272 1272
Austria 159 159
Azerbaijan 159 159
Bahamas 159 159
Bahrain 159 159
Bangladesh 159 159
Barbados 159 159
Belarus 159 159
Belgium 159 159
Belize 159 159
Benin 159 159
Bhutan 159 159
Bolivia 159 159
Bosnia and Herzegovina 159 159
Botswana 159 159
Brazil 159 159
Brunei 159 159
Bulgaria 159 159
Burkina Faso 159 159
Burma 159 159
Burundi 159 159
Cabo Verde 159 159
Cambodia 159 159
Cameroon 159 159
Canada 2226 2226
Central African Republic 159 159
Chad 159 159
Chile 159 159
China 5247 5247
Colombia 159 159
Comoros 159 159
Congo (Brazzaville) 159 159
Congo (Kinshasa) 159 159
Costa Rica 159 159
Cote d’Ivoire 159 159
Croatia 159 159
Cuba 159 159
Cyprus 159 159
Czechia 159 159
Denmark 477 477
Diamond Princess 159 159
Djibouti 159 159
Dominica 159 159
Dominican Republic 159 159
Ecuador 159 159
Egypt 159 159
El Salvador 159 159
Equatorial Guinea 159 159
Eritrea 159 159
Estonia 159 159
Eswatini 159 159
Ethiopia 159 159
Fiji 159 159
Finland 159 159
France 1749 1749
Gabon 159 159
Gambia 159 159
Georgia 159 159
Germany 159 159
Ghana 159 159
Greece 159 159
Grenada 159 159
Guatemala 159 159
Guinea 159 159
Guinea-Bissau 159 159
Guyana 159 159
Haiti 159 159
Holy See 159 159
Honduras 159 159
Hungary 159 159
Iceland 159 159
India 159 159
Indonesia 159 159
Iran 159 159
Iraq 159 159
Ireland 159 159
Israel 159 159
Italy 159 159
Jamaica 159 159
Japan 159 159
Jordan 159 159
Kazakhstan 159 159
Kenya 159 159
Korea, South 159 159
Kosovo 159 159
Kuwait 159 159
Kyrgyzstan 159 159
Laos 159 159
Latvia 159 159
Lebanon 159 159
Lesotho 159 159
Liberia 159 159
Libya 159 159
Liechtenstein 159 159
Lithuania 159 159
Luxembourg 159 159
Madagascar 159 159
Malawi 159 159
Malaysia 159 159
Maldives 159 159
Mali 159 159
Malta 159 159
Mauritania 159 159
Mauritius 159 159
Mexico 159 159
Moldova 159 159
Monaco 159 159
Mongolia 159 159
Montenegro 159 159
Morocco 159 159
Mozambique 159 159
MS Zaandam 159 159
Namibia 159 159
Nepal 159 159
Netherlands 795 795
New Zealand 159 159
Nicaragua 159 159
Niger 159 159
Nigeria 159 159
North Macedonia 159 159
Norway 159 159
Oman 159 159
Pakistan 159 159
Panama 159 159
Papua New Guinea 159 159
Paraguay 159 159
Peru 159 159
Philippines 159 159
Poland 159 159
Portugal 159 159
Qatar 159 159
Romania 159 159
Russia 159 159
Rwanda 159 159
Saint Kitts and Nevis 159 159
Saint Lucia 159 159
Saint Vincent and the Grenadines 159 159
San Marino 159 159
Sao Tome and Principe 159 159
Saudi Arabia 159 159
Senegal 159 159
Serbia 159 159
Seychelles 159 159
Sierra Leone 159 159
Singapore 159 159
Slovakia 159 159
Slovenia 159 159
Somalia 159 159
South Africa 159 159
South Sudan 159 159
Spain 159 159
Sri Lanka 159 159
Sudan 159 159
Suriname 159 159
Sweden 159 159
Switzerland 159 159
Syria 159 159
Taiwan* 159 159
Tajikistan 159 159
Tanzania 159 159
Thailand 159 159
Timor-Leste 159 159
Togo 159 159
Trinidad and Tobago 159 159
Tunisia 159 159
Turkey 159 159
Uganda 159 159
Ukraine 159 159
United Arab Emirates 159 159
United Kingdom 1749 1749
Uruguay 159 159
US 159 159
US_state 518499 518499
Uzbekistan 159 159
Venezuela 159 159
Vietnam 159 159
West Bank and Gaza 159 159
Western Sahara 159 159
Yemen 159 159
Zambia 159 159
Zimbabwe 159 159
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 6505
Alaska 1364
Arizona 1573
Arkansas 6930
California 5985
Colorado 5871
Connecticut 934
Delaware 395
Diamond Princess 104
District of Columbia 105
Florida 6810
Georgia 15484
Grand Princess 105
Guam 105
Hawaii 537
Idaho 3194
Illinois 9202
Indiana 8953
Iowa 8680
Kansas 7445
Kentucky 10425
Louisiana 6425
Maine 1644
Maryland 2465
Massachusetts 1548
Michigan 8024
Minnesota 7696
Mississippi 7982
Missouri 9414
Montana 2894
Nebraska 5803
Nevada 1299
New Hampshire 1115
New Jersey 2345
New Mexico 2836
New York 6005
North Carolina 9536
North Dakota 3592
Northern Mariana Islands 90
Ohio 8469
Oklahoma 6655
Oregon 3263
Pennsylvania 6611
Puerto Rico 105
Rhode Island 622
South Carolina 4618
South Dakota 4534
Tennessee 9211
Texas 20378
Utah 1422
Vermont 1500
Virgin Islands 105
Virginia 12165
Washington 4087
West Virginia 4612
Wisconsin 6578
Wyoming 2056
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 78436
China 159
Diamond Princess 140
Korea, South 130
Japan 129
Italy 127
Iran 124
Singapore 121
France 120
Germany 120
Spain 119
US 117
Switzerland 116
United Kingdom 116
Belgium 115
Netherlands 115
Norway 115
Sweden 115
Austria 113
Malaysia 112
Australia 111
Bahrain 111
Denmark 111
Canada 110
Qatar 110
Iceland 109
Brazil 108
Czechia 108
Finland 108
Greece 108
Iraq 108
Israel 108
Portugal 108
Slovenia 108
Egypt 107
Estonia 107
India 107
Ireland 107
Kuwait 107
Philippines 107
Poland 107
Romania 107
Saudi Arabia 107
Indonesia 106
Lebanon 106
Pakistan 106
San Marino 106
Thailand 106
Chile 105
Luxembourg 104
Peru 104
Russia 104
Ecuador 103
Mexico 103
Slovakia 103
South Africa 103
United Arab Emirates 103
Armenia 102
Colombia 102
Croatia 102
Panama 102
Serbia 102
Taiwan* 102
Turkey 102
Argentina 101
Bulgaria 101
Latvia 101
Uruguay 101
Algeria 100
Costa Rica 100
Dominican Republic 100
Hungary 100
Andorra 99
Bosnia and Herzegovina 99
Jordan 99
Lithuania 99
Morocco 99
New Zealand 99
North Macedonia 99
Vietnam 99
Albania 98
Cyprus 98
Malta 98
Moldova 98
Brunei 97
Burkina Faso 97
Sri Lanka 97
Tunisia 97
Ukraine 96
Azerbaijan 95
Ghana 95
Kazakhstan 95
Oman 95
Senegal 95
Venezuela 95
Afghanistan 94
Cote d’Ivoire 94
Cuba 93
Mauritius 93
Uzbekistan 93
Cambodia 92
Cameroon 92
Honduras 92
Nigeria 92
West Bank and Gaza 92
Belarus 91
Georgia 91
Bolivia 90
Kosovo 90
Kyrgyzstan 90
Montenegro 90
Congo (Kinshasa) 89
Kenya 88
Niger 87
Guinea 86
Rwanda 86
Trinidad and Tobago 86
Paraguay 85
Bangladesh 84
Djibouti 82
El Salvador 81
Guatemala 80
Madagascar 79
Mali 78
Congo (Brazzaville) 75
Jamaica 75
Gabon 73
Somalia 73
Tanzania 73
Ethiopia 72
Burma 71
Sudan 70
Liberia 69
Maldives 67
Equatorial Guinea 66
Cabo Verde 64
Sierra Leone 62
Guinea-Bissau 61
Togo 61
Zambia 60
Eswatini 59
Chad 58
Tajikistan 57
Haiti 55
Sao Tome and Principe 55
Benin 53
Nepal 53
Uganda 53
Central African Republic 52
South Sudan 52
Guyana 50
Mozambique 49
Yemen 45
Mongolia 44
Mauritania 41
Nicaragua 41
Malawi 35
Syria 35
Zimbabwe 33
Bahamas 32
Libya 32
Comoros 30
Suriname 22
Angola 19
Eritrea 14
Burundi 13
Monaco 7
Namibia 4
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 159
Korea, South 130
Japan 129
Italy 127
Iran 124
Singapore 121
France 120
Germany 120
Spain 119
US 117
Switzerland 116
United Kingdom 116
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-02-07        18299
## 2    Afghanistan           <NA> <NA> 2020-04-13        18365
## 3    Afghanistan           <NA> <NA> 2020-02-08        18300
## 4    Afghanistan           <NA> <NA> 2020-06-05        18418
## 5    Afghanistan           <NA> <NA> 2020-06-04        18417
## 6    Afghanistan           <NA> <NA> 2020-02-06        18298
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     0            NaN
## 2                     21                   665     0.03157895
## 3                      0                     0            NaN
## 4                    309                 18969     0.01628974
## 5                    300                 18054     0.01661682
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                      -Inf                       -Inf        18348
## 2                  2.822822                   1.322219        18348
## 3                      -Inf                       -Inf        18348
## 4                  4.278044                   2.489958        18348
## 5                  4.256573                   2.477121        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -49  NA   NA         NA                           NA
## 2             17  NA   NA         NA                           NA
## 3            -48  NA   NA         NA                           NA
## 4             70  NA   NA         NA                           NA
## 5             69  NA   NA         NA                           NA
## 6            -50  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
New Hampshire Parks 0.9477267 -20.0
Hawaii Transit 0.9337007 -89.0
Vermont Parks 0.9245249 -35.5
Maine Transit -0.9197347 -50.0
Hawaii Parks 0.9011090 -72.0
Connecticut Grocery_Pharmacy -0.8815014 -6.0
Utah Transit -0.8584685 -18.0
Utah Residential -0.8298114 12.0
South Dakota Parks 0.7868708 -26.0
Rhode Island Workplace -0.7691260 -39.5
Arizona Residential 0.7670081 13.0
Arizona Grocery_Pharmacy -0.7597987 -15.0
Connecticut Transit -0.7521408 -50.0
Massachusetts Workplace -0.7494943 -39.0
Alaska Residential 0.7461094 13.0
Nevada Transit -0.7340523 -20.0
Alaska Workplace -0.7164857 -33.0
Idaho Residential -0.7102565 11.0
Wyoming Parks -0.6773507 -4.0
North Dakota Parks 0.6591628 -34.0
New York Workplace -0.6546234 -34.5
Hawaii Workplace -0.6439550 -46.0
Vermont Grocery_Pharmacy -0.6289682 -25.0
Rhode Island Retail_Recreation -0.6280705 -45.0
Arkansas Parks 0.6097123 -12.0
Rhode Island Residential -0.5981837 18.5
New Jersey Parks -0.5935765 -6.0
New York Retail_Recreation -0.5912506 -46.0
Maine Workplace -0.5871386 -30.0
Utah Parks -0.5868620 17.0
Nebraska Workplace 0.5763620 -32.0
Alaska Grocery_Pharmacy -0.5519294 -7.0
New York Parks 0.5307795 20.0
Hawaii Grocery_Pharmacy 0.5214587 -34.0
Connecticut Retail_Recreation -0.5174922 -45.0
New Jersey Workplace -0.5169103 -44.0
Maine Parks 0.5161871 -31.0
Utah Workplace -0.5129174 -37.0
Connecticut Residential 0.5069468 14.0
Massachusetts Retail_Recreation -0.4929561 -44.0
New Mexico Grocery_Pharmacy -0.4843827 -11.0
Arizona Retail_Recreation -0.4835134 -42.5
South Carolina Parks -0.4803608 -23.0
New Jersey Grocery_Pharmacy -0.4784647 2.5
Connecticut Workplace -0.4698748 -39.0
Nebraska Residential -0.4659245 14.0
Arizona Transit 0.4658453 -38.0
New Mexico Residential 0.4623380 13.5
Maryland Workplace -0.4580481 -35.0
Rhode Island Parks 0.4463111 52.0
Iowa Parks -0.4382099 28.5
West Virginia Parks 0.4336659 -33.0
Kentucky Parks -0.4309633 28.5
Missouri Residential -0.4301443 13.0
Pennsylvania Workplace -0.4226384 -36.0
North Dakota Retail_Recreation -0.4217110 -42.0
Vermont Residential 0.4198163 11.5
Illinois Transit -0.4137696 -31.0
Maryland Grocery_Pharmacy -0.4091531 -10.0
New Jersey Retail_Recreation -0.4079658 -62.5
Hawaii Retail_Recreation 0.4057813 -56.0
Massachusetts Grocery_Pharmacy -0.4023713 -7.0
Pennsylvania Retail_Recreation -0.4001715 -45.0
New Jersey Transit -0.4001567 -50.5
New Mexico Parks 0.3828741 -31.5
New Hampshire Residential -0.3828116 14.0
New Mexico Retail_Recreation -0.3783281 -42.5
Montana Parks -0.3765165 -58.0
Alabama Workplace -0.3674324 -29.0
New York Transit -0.3641953 -48.0
Hawaii Residential 0.3580545 19.0
Wisconsin Transit -0.3534841 -23.5
Florida Residential 0.3519461 14.0
Montana Residential 0.3505699 14.0
Montana Transit 0.3472663 -41.0
Nevada Workplace -0.3472037 -40.0
Alabama Grocery_Pharmacy -0.3457523 -2.0
Oregon Parks -0.3456409 16.5
Michigan Parks 0.3437456 28.5
Nebraska Grocery_Pharmacy 0.3431356 -0.5
Virginia Transit -0.3329746 -32.5
Maryland Retail_Recreation -0.3289390 -39.0
Maine Retail_Recreation -0.3288448 -42.0
Nevada Residential 0.3253393 17.0
South Carolina Workplace 0.3240706 -30.0
Alaska Retail_Recreation 0.3224827 -39.0
Colorado Residential 0.3222051 14.0
North Dakota Workplace 0.3091853 -40.0
California Parks -0.3072642 -38.5
Arkansas Retail_Recreation -0.3041952 -30.0
Vermont Workplace -0.2994840 -43.0
California Residential 0.2988130 14.0
Illinois Workplace -0.2985793 -30.5
Pennsylvania Parks 0.2846186 12.0
Nevada Retail_Recreation -0.2839649 -43.0
Minnesota Transit -0.2811077 -28.5
Idaho Workplace -0.2788331 -29.0
Vermont Retail_Recreation 0.2783981 -57.0
California Transit -0.2767102 -42.0
West Virginia Grocery_Pharmacy -0.2752521 -6.0
Pennsylvania Grocery_Pharmacy -0.2707694 -6.0
Missouri Workplace 0.2707440 -29.0
Kansas Workplace 0.2703750 -32.5
Maryland Residential 0.2669150 15.0
Rhode Island Grocery_Pharmacy 0.2621455 -7.5
Idaho Transit -0.2596099 -30.0
North Carolina Grocery_Pharmacy 0.2567758 0.0
North Carolina Workplace 0.2561866 -31.0
Idaho Grocery_Pharmacy -0.2554313 -5.5
Washington Grocery_Pharmacy 0.2528150 -7.0
Tennessee Parks -0.2514103 10.5
West Virginia Workplace 0.2499720 -33.0
Tennessee Workplace -0.2460965 -31.0
Rhode Island Transit -0.2460045 -56.0
Illinois Parks 0.2457531 26.5
Minnesota Parks 0.2456429 -9.0
Alabama Transit -0.2449807 -36.5
Alabama Parks 0.2444336 -1.0
South Dakota Workplace 0.2427141 -35.0
Georgia Grocery_Pharmacy -0.2417446 -10.0
Wisconsin Parks 0.2409644 51.5
Tennessee Residential 0.2377122 11.5
Florida Parks -0.2355734 -43.0
Oregon Residential -0.2310860 10.5
Missouri Parks 0.2281155 0.0
Indiana Parks -0.2273583 29.0
New York Grocery_Pharmacy -0.2264106 8.0
Nebraska Transit -0.2234621 -9.0
Georgia Retail_Recreation -0.2175143 -41.0
Idaho Parks 0.2124153 -19.0
Wyoming Grocery_Pharmacy -0.2114884 -10.0
Georgia Workplace -0.2088598 -33.5
Arizona Parks -0.2077651 -44.5
Oregon Transit 0.2073982 -27.5
Texas Workplace 0.2073343 -32.0
North Dakota Grocery_Pharmacy -0.2050117 -8.0
Connecticut Parks 0.2048228 43.0
Illinois Residential 0.2016834 14.0
South Dakota Residential 0.1945860 15.0
North Carolina Transit 0.1934118 -32.0
Kansas Grocery_Pharmacy -0.1931308 -14.5
Colorado Parks -0.1873236 2.0
Virginia Residential 0.1865800 14.0
Wyoming Workplace -0.1850065 -31.0
Wisconsin Residential -0.1810737 14.0
Mississippi Grocery_Pharmacy -0.1797100 -8.0
New Hampshire Retail_Recreation -0.1779615 -41.0
Oklahoma Parks 0.1775129 -18.5
Pennsylvania Transit -0.1767528 -42.0
Texas Residential -0.1760042 15.0
Oregon Retail_Recreation 0.1738406 -40.5
Oregon Grocery_Pharmacy -0.1715383 -7.0
North Carolina Residential 0.1713397 13.0
Ohio Transit 0.1693449 -28.0
Indiana Residential 0.1677119 12.0
Massachusetts Parks 0.1624120 39.0
Kentucky Residential 0.1594478 12.0
Kentucky Grocery_Pharmacy 0.1583371 4.0
Kentucky Transit -0.1579524 -31.0
Utah Retail_Recreation -0.1578711 -40.0
Florida Transit -0.1537387 -49.0
North Dakota Residential -0.1533154 17.0
Montana Retail_Recreation 0.1524484 -51.0
Mississippi Residential 0.1516953 13.0
Oklahoma Residential -0.1445414 15.0
South Carolina Transit -0.1422373 -45.0
New Mexico Transit 0.1414688 -38.5
Mississippi Parks -0.1369984 -25.0
New Jersey Residential 0.1362882 18.0
West Virginia Residential -0.1336602 11.0
Minnesota Retail_Recreation 0.1320453 -40.5
Utah Grocery_Pharmacy 0.1311923 -4.0
Wisconsin Grocery_Pharmacy 0.1277613 -1.0
North Carolina Retail_Recreation 0.1270227 -34.0
Iowa Transit 0.1263650 -24.0
Virginia Grocery_Pharmacy -0.1260133 -8.0
North Dakota Transit 0.1245057 -48.0
Mississippi Retail_Recreation -0.1219259 -40.0
Kansas Transit -0.1216098 -23.0
Indiana Retail_Recreation 0.1204022 -38.0
Kansas Parks 0.1198176 72.0
Texas Parks 0.1195841 -42.0
Vermont Transit -0.1163582 -63.0
Texas Grocery_Pharmacy 0.1152372 -14.0
Michigan Workplace -0.1121000 -40.0
Idaho Retail_Recreation -0.1115842 -40.0
Wisconsin Workplace -0.1107145 -31.0
New Hampshire Grocery_Pharmacy -0.1101549 -6.0
Oklahoma Grocery_Pharmacy -0.1073430 -0.5
Nebraska Retail_Recreation 0.1067249 -36.0
Iowa Grocery_Pharmacy -0.1037112 4.0
Oklahoma Workplace 0.1028192 -31.0
Minnesota Grocery_Pharmacy 0.1001441 -6.0
Virginia Parks 0.0990233 6.0
Alabama Retail_Recreation 0.0989716 -39.0
Missouri Transit -0.0988464 -24.5
Massachusetts Residential 0.0983701 15.0
Massachusetts Transit -0.0983541 -45.0
Iowa Workplace 0.0966075 -30.0
Maryland Transit -0.0955336 -39.0
Indiana Workplace 0.0951220 -34.0
Arkansas Workplace -0.0907779 -26.0
California Grocery_Pharmacy -0.0905986 -11.5
Alabama Residential -0.0900233 11.0
Arkansas Transit -0.0900018 -27.0
Wyoming Transit -0.0897314 -17.0
New York Residential 0.0888871 17.5
Georgia Transit -0.0883979 -35.0
Arkansas Residential -0.0861542 12.0
Michigan Transit 0.0843991 -46.0
Montana Workplace -0.0829065 -40.0
California Retail_Recreation -0.0786922 -44.0
Virginia Retail_Recreation -0.0779604 -35.0
Washington Retail_Recreation 0.0740991 -42.0
West Virginia Retail_Recreation -0.0720076 -38.5
Ohio Grocery_Pharmacy 0.0710103 0.0
Oregon Workplace 0.0703615 -31.0
Kentucky Retail_Recreation 0.0692770 -29.0
Ohio Residential 0.0689441 14.0
Pennsylvania Residential 0.0680125 15.0
West Virginia Transit -0.0668808 -45.0
South Carolina Residential -0.0635251 12.0
Michigan Retail_Recreation -0.0619322 -53.0
South Dakota Transit -0.0605721 -40.0
Florida Retail_Recreation 0.0600155 -43.0
Virginia Workplace -0.0590807 -32.0
Minnesota Residential -0.0579750 17.0
Tennessee Transit 0.0572473 -32.0
Iowa Retail_Recreation 0.0571980 -38.0
North Carolina Parks -0.0564212 7.0
South Dakota Retail_Recreation -0.0562015 -39.0
Michigan Residential 0.0558708 15.0
New Hampshire Transit 0.0557898 -57.0
Washington Transit -0.0545249 -33.5
South Carolina Grocery_Pharmacy -0.0542759 1.0
Georgia Parks 0.0539916 -6.0
Texas Retail_Recreation 0.0533583 -40.0
Indiana Transit 0.0517161 -29.0
Arkansas Grocery_Pharmacy -0.0514819 3.0
New Mexico Workplace 0.0458555 -34.0
Mississippi Workplace 0.0456589 -33.0
Colorado Retail_Recreation -0.0447725 -44.0
Ohio Workplace -0.0439973 -35.0
Florida Workplace 0.0427143 -33.0
Arizona Workplace -0.0424958 -35.0
Illinois Retail_Recreation 0.0415781 -40.0
New Hampshire Workplace 0.0364884 -37.0
Missouri Grocery_Pharmacy -0.0361160 2.0
Ohio Retail_Recreation 0.0353986 -36.0
Tennessee Retail_Recreation -0.0347219 -30.0
Wyoming Retail_Recreation 0.0343831 -39.0
Washington Workplace 0.0339872 -38.0
Kentucky Workplace -0.0324807 -36.5
Nebraska Parks 0.0314042 55.5
Colorado Grocery_Pharmacy -0.0313079 -17.0
Wyoming Residential 0.0305690 12.5
Mississippi Transit 0.0302557 -38.5
Texas Transit 0.0300339 -41.5
Oklahoma Retail_Recreation -0.0296660 -31.0
Nevada Parks 0.0281423 -12.5
Maine Residential -0.0262934 11.0
Washington Residential -0.0249645 13.0
Maine Grocery_Pharmacy -0.0244000 -13.0
Iowa Residential 0.0236516 13.0
Georgia Residential -0.0231487 13.0
Oklahoma Transit 0.0206542 -26.0
Missouri Retail_Recreation -0.0203047 -36.0
Illinois Grocery_Pharmacy -0.0195896 2.0
California Workplace 0.0191664 -36.0
Colorado Transit 0.0186400 -36.0
Montana Grocery_Pharmacy 0.0172809 -16.0
South Dakota Grocery_Pharmacy -0.0167764 -9.0
Kansas Residential -0.0156304 13.0
Maryland Parks 0.0152397 27.0
Florida Grocery_Pharmacy -0.0148329 -14.0
Michigan Grocery_Pharmacy 0.0145288 -11.0
South Carolina Retail_Recreation -0.0128589 -35.0
Minnesota Workplace -0.0110191 -33.0
Tennessee Grocery_Pharmacy 0.0060957 6.0
Ohio Parks -0.0058202 67.5
Nevada Grocery_Pharmacy -0.0050303 -12.5
Washington Parks -0.0048955 -3.5
Wisconsin Retail_Recreation -0.0040934 -44.0
Kansas Retail_Recreation 0.0030842 -38.0
Indiana Grocery_Pharmacy 0.0030371 -5.5
Colorado Workplace -0.0014878 -39.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Mon Jun 29 14:06:14 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net